Computer simulations of language change notes

This website collects my personal notes on Computer simulations of language change. These notes are provided to bring full transparency to my research process. Of course, since they are only notes, they do not reflect my final thoughts on a topic, and should not be interpreted as such. To read finished papers, please consult my website. Do not use these notes as a basis for your own scientific research. Start from high-quality, peer-reviewed scientific literature instead.

Frequency and the emergence of linguistic structure

p. 1

Introduction to frequency and the emergence of linguistic structure

Joan Bybee and Paul Hopper

Introduction

From independent structure to usage-based structure

In contrast, outside linguistics it is widely held that cognitive representations are highly affected by experience. In humans and non-humans detailed tracking of probabilities leads to behavior that promotes survival (Kelly and Martin 1994).

↓

irregular morphological formations with high frequency are less likely to regularize
regular patterns have a wider range of applicability
high frequency phrases undergo special reduction

↳ Zipf

catalogued and described these effects
today: known for ‘Zipf’s law’

Zipf coined the term “dynamic philology’’ for the quantitative study of language change and its relevance for linguistic structure.

p. 2

1980s hypothesis

grammar comes about through the repeated adaptation of forms to live discourse (Hopper 1979; Givón 1979; Givón (ed.) 1983; Hopper and Thompson 1980, 1984; Du Bois 1985)
also: how can experience with language (reflected in frequency) affect cognitive representations and categorisation?

Time and again the operation of linguistic rules has been found to be limited by lexical constraints, sometimes to the point where a construction is valid only for one or two specific words.

emergence
(Hopper 1987, 1998, 1988, 1993)

ongoing process of structuration (Giddens 1984)
- “the conditions which govern the continuity and dissolution of structures or types of structures’’ (Giddens 1977: 120)
emergent structures are unstable and are manifested stochastically
⇒ the fixing of linguistic groups of all kinds as recognizably structural units is an ongoing process

p. 2-3

“Grammar’’ itself and associated theoretical postulates like “syntax’’ and “phonology’’ have no autonomous existence beyond local storage and real-time processing (Hopper 1987; Bybee, this volume).

p. 3

Contents of the volume

Two major principles

The distribution and frequency of the units of language are governed by the content of people’s interactions, which consist of a preponderance of subjective, evaluative statements, dominated by the use of pronouns, copulas and intransitive clauses.
The frequency with which certain items and strings of items are used has a profound influence on the way language is broken up into chunks in memory storage, the way such chunks are related to other stored material and the ease with which they are accessed.

p. 4

Patterns of use in natural discourse

Use of natural discourse data

mismatch

there is a very serious mismatch between the results of quantitative studies and grammatical accounts

p. 5

[Hopper and Thompson] note that lexical frames for verbs that specify their possible argument structures in advance of usage are often violated in practice, and that the more frequent a verb type, the less predictable the number of arguments; a rare verb like to elapse is limited to a single argument, whereas a common verb like to get appears in discourse with one, two, or three of the traditional arguments depending on the speaker’s need. Scheibman, arguing for the centrality of subjective expression in conversational English, points out that this role of subjectivity is in opposition to the privileging of referential language in standard linguistic analysis.

Subjectivity

How to work with discourse in frequency studies?

contextual meaning is irrelevant	contextual meaning is very important
morphological and phonological questions	other questions

p. 7

In another paper influenced by Bybee’s “Usage-based Phonology,’’ [which one???] Bush studies the palatalization of segments across word boundaries in, for example, “would you’’ > [wudju] as opposed to the absence of such palatalization in sequences such as “good you’’ (which had been noted by earlier researchers). Bush invokes transitional probability, the degree of likelihood that one word will be followed by a specific collocate. He concludes that the discourse “chunking’’ of lexical words creates units that may behave in every respect like unitary words, permitting the application of processes that are otherwise word-internal (see Bybee 2000a). His study indicates that frequency of cooccurrence significantly drives assimilation whether words are function or content words. Palatalization in conversation is not restricted to the pronoun you as suggested by some studies, nor is it possible to predict its occurrence with reference to constituent structure. Pairs of words that are frequently used together, whatever their apparent constituency and status as lexical or grammatical (don’t you, told you, that you, last year), are more likely to show effects of coarticulation than words that are used together less often.

p. 8

Units of usage

units of storage?

are the units of usage!
as people do not speak in isolated morphemes or words, […] the units of memory and processing contain multiple morphemes and even multiple words (see Wray and Perkins 2000)

categorisation of units

network based on the user’s experience (Bybee 1998)

p. 9

↓

network ontology

organized into exemplars on the basis of high similarity of phonetic shape and function or meaning
such exemplars are tagged for their contextual associations
- both linguistic and extra- linguistic

↳ evidence

both direct and indirect frequency effects can be demonstrated for these units

‘strength’

tokens of experience strengthen stored exemplars (Bybee 1985; Pierrehumbert, this volume)

p. 10

Frequency effects and cognitive mechanisms in emergent grammar

The notion of emergent structure has become important in various branches of the sciences in the last two decades. The basic idea is that what may appear to be a coherent structure created according to some underlying design may in fact be the result of multiple applications or interactions of simple mechanisms that operate according to local principles and create the seemingly well-planned structure as a consequence.

Phonological reduction in high frequency words and strings

reduction effect
(Schuchardt 1885)

words of higher frequency tend to undergo sound change at a faster rate than words of lower frequency

↳ graduality of sound change

both phonetically and lexically
specific phonetic features are associated with lexical items

p. 11

advantage of the exemplar model

allows distinct representation of forms (⟷ non-exemplar models)
you need an exemplar-based storage model of cognition to account for these types of changes

The origins of reduction are in the automatization of neuro-motor sequences which comes about with repetition. [this is contested]

why frequent words more often?

they are exposed to […] on-line processes [(automatisation)] more than infrequent words

discourse factors

the speaker seems to be able to gauge how much phonetic information the hearer needs in order to access the correct word (👁 ↓)

Where the word is primed by the other words in the context, it is also easier to access. The persistent use of this strategy by speakers leads to the development of a listener strategy by which reduced words are judged to be repetitions and thus part of the background in the discourse (Fenk-Oczlon, this volume). Thus with the reduction the speaker signals that the reduced word is just the same old word as used before, not a new one.

p. 12

The paper by Jurafsky et al. (this volume) takes into account a number of factors under the Probabilistic Reduction Hypothesis, which includes not just the predictability of a word within a particular discourse, but also its cumulative token frequency and the probability of a word given neighboring words.

Jurafsky et al. provide useful formulae for calculating the predictability of a word given the previous and following word. They study the top ten most frequent words of English, which are all function words (a, the, in, of, to, and, that I, it, you). These words both show more vowel reduction and shorter duration as they are more predictable from the preceding and following word. In contrast, content words ending in /t/ or /d/ were studied for the deletion of their final consonant and here they find that only the frequency of the word containing the /t/ or /d/ predicts the rate of deletion.

p. 13

↓

nature of mental representation

Functional change due to high frequency

grammaticisation

functional and semantic change due to high frequency
the mechanism by which structure emerges from language use

p. 14

Frequency and the formation of constructions

constituent structure

determined by frequency of co-occurrence (Bybee and Scheibman 1999)
the more often two elements occur in sequence the tighter will be their constituent structure

Clear examples are cases in which two words have fused because of their frequent co-occurrence and now behave essentially as single words:

want to > wanna
going to > gonna
I am > I’m
can not > can’t
do not > don’t
I don’t know > I dunno
would have > would’ve

(Boyland 1996; Bybee and Scheibman 1999; Krug 1998, this volume).

↳ constituent boundaries

lost as frequency rises

p. 16

Frequency and accessibility

speed of lexical access and frequency

strongly linked!
so: frequency of use may make access of larger units easier as well

Strings such as you and I, come on, fall over, and common sequences with liaison in French, such as mes amis ‘my friends’, c’est un ‘it’s a’, and l’un avec l’autre ‘with one another’ may be more efficiently accessed as units than composed morpheme by morpheme.

p. 17

Retention of conservative properties in high frequency units

Two types of change for high frequency units
reductive processes	analogical change
due to language use	due to analogy
highly eligible	highly conservative

↓

different types, different susceptibility

For linguistic theory the major consequence of the finding that high frequency units are resistant to reformation on the basis of productive patterns is that the resistant units must have storage in memory in order to resist change and in order to be affected by frequency of use.

p. 18

Stochastic grammar

stochastic grammar

‘variablity’ of grammatical structure

p. 19

↓

grammar

not fixed! → intrinsically variable

Conclusion

skipped

p. 123

Lexical diffusion, lexical frequency, and lexical analysis

Lexical analysis and lexical frequency

“High-frequency words form more distant lexical connections than low-frequency words. In the case of morphologically complex words . . . high-frequency words undergo less analysis, and are less dependent on their related base words than low-frequency words’’ (Bybee 1985 : 118)

Philips (1998: 231) on lexical diffusion
changes which require analysis	changes which eliminate or ignore grammatical information
affect the least frequent words first	affect the most frequent words first

↳ a modification of an earlier Frequency-Actuation Hypothesis (Phillips 1984: 336)

p. 124

↓

this paper: a further refinement of the hypothesis (“Frequency-Implementation Hypothesis”)

Frequency, analysis and sound change

Why do some stress shifts affect the least frequent words first, whereas others affect the most frequent words first? (Phillips 1998)

(I don’t know what’s going on here)

p. 128

Lexical analysis and word class

word class

needs to be treated as an independent factor in sound change

(again, all kinds of things are going on)

p. 132

Is there any reason why this should be the case, that is, that word frequency effects are felt inside of word classes? The answer may be because speakers access word class before they access phonological structure. As van Turennout et al. (1998: 572) observe, “data from behavioral studies as well as from neuropsychological studies of patients with language impairment have suggested that a word’s semantic and syntactic properties are retrieved before its phonological form is constructed.’’

[(I’m finding this really hard to believe)]

p. 134

Conclusion

sound change is influenced by…

word frequency
word class
neighbourhood density

↳ rich lexicon

rich in detail and in interconnections

It does seem that the factor of neighborhood density must be incorporated into a psychologically real model of the lexicon and the effect of sound change upon that lexicon.
And it does seem that in determining which words are affected first in a sound change, word class takes precedence over word frequency [(again, I can hardly believe this)].
Finally, within word classes, sound changes which require fine analysis of the lexical entry (including neighborhood density effects, word class, morphological make-up, as well as phonotactic constraints and typological sound changes in general) affect the least frequent words first.

In brief, the Frequency-Implementation Hypothesis does hold: “Changes which require analysis — whether syntactic, morphological, or phonological — during their implementation affect the least frequent words first; others affect the most frequent words first.

p. 137

Exemplar dynamics: Word frequency, lenition and contrast

Introduction

Phonological detail is saved

phonological detail in usage-based grammar

built up through experience with speech
level of detail: specific words in the lexicon of a given dialect

Challenges to standard models of phonology and phonetics

1. no possibility for differing phonetic realisations

standard model: lexicon and phonology are separate
- phonetic implementations are computed from phonological rules
however: no room for word-specific distributions

2. differential phonetic outcomes relate to word frequency

the intrusion of word frequency into a traditional area of linguistics is not accommodated

Storage of phonetic material

completely idiosyncratic storage?

here: phonetic form is stored in an isolated manner, independently

↕

reality

though word-specific phenomena exist, there are connections at lower levels
subparts do exist!

↳ the correct model must describe the interaction of word-specific phonetic detail with more general principles of phonological structure

Goal of the paper

Develop a formal architecture which is capable of capturing these regularities

(some general goals)

exemplar theory

a psychological model of similarity and classification

p. 140

Exemplar theory

How does the exemplar model work?

exemplar workings

each category is represented in memory by a large cloud of remembered tokens of that category
organized in a cognitive map
- memories of highly similar instances are close to each other
- memories of dissimilar instances are far apart
⇒ display the range of variation

For example, the remembered tokens of the vowel /ɛ/ would exhibit a variety of formant values (related to variation in vocal tract anatomy across speakers, variation along the dimension of hypo-hyperarticulation, and so forth) as well as variation in f0 and in duration. The entire system is then a mapping between points in a phonetic parameter space and the labels of the categorization system.

It is important to note that the same remembered tokens may be simultaneously subject to more than one categorization scheme, under such a model.

↓ consequence

frequent categories	infrequent categories
represented by numerous tokens	represented by less numerous tokens

The mind’s capacity for long-term memories of individual examples is in fact astonishingly large, as experiments reviewed in Johnson (1996) indicate.

How the multitude of exemplars is managed

1. memories decay

memories of utterances that we heard yesterday are more vivid than memories from a decade ago
exemplars encoding frequent recent experiences have higher resting activation levels than exemplars encoding infrequent and temporally remote experiences

p. 141

2. granular parameter space

examples whose differences are too fine to show up under the granularization are encoded as identical (see Kruschke 1992)

For example, the ear cannot distinguish arbitrarily fine differences in f0. The JND (just noticeable difference) for f0 in any given part of the range is determined by the resolution of the anatomical and neural mechanisms which are involved in encoding f0. Thus, it is reasonable to suppose that speech tokens differing by less than one JND in f0 are stored as if they had identical f0s.

Classification of new tokens

Similarity

similarity

how close is the token to the exemplars already stored?
similarity to any single stored exemplar can be computed as its distance from the exemplar in the parameter space

↓

exact operationalisation

👁 ↓

A fixed size neighborhood around the new token determines the set of exemplars which influence the classification.
The summed similarities to the exemplars for each label instantiated in that neighborhood is computed, with the similarity to each given exemplar weighted by the strength (or activation) of that exemplar
- Recall that the strength is a function of the number and recency of phonetic tokens at that location in the exemplar space.

p. 142

(Simplified) example of exemplar space

We note also that attentional weights may be imposed to model how different contexts, expectations, and task requirements influence classification; however these effects are not at issue in the present paper.

Influence of frequency

frequency

holds an advantage in the ‘categorisation competition’
high frequency labels are associated with more numerous exemplars → more dense and more activated exemplar clouds
⇒ high-frequency labels have a higher probability of being selected

p. 143

Frequency is not overtly encoded in the model. Instead, it is intrinsic to the cognitive representations for the categories. More frequent categories have more exemplars and more highly activated exemplars than less frequent categories.

Influence of decay

label strength

influenced by number / frequency (👁 ↑), but also activation level!
the more recent, the more activated an exemplar will be

Advantages of the model

1. fine-grained

shows detailed phonetic knowledge that speakers are assumed to have

2. prototype effects

a new token which is well-positioned with respect to a category can actually provide a better example of that category

p. 144

3. extreme examples are well judged

logical: maximal distance from competing labels

4. foundation for modelling frequency effects

frequency is foundation of the mechanism by which memories of categories are stored and new examples are classified

Production

How to extend our model of 👁 ↑ to production?

Model 1

Starting production

start of production

= the activation of a specific label

↓

p. 145

exemplar selection

‘random’ selection from that given label
here as well: strength of an exemplar is decisive in decision making

[A] phonetic target is not necessarily achieved exactly. Even for a speaker who is merely talking to himself, one may assume random deviations from the phonetic target due to noise in the motor control and execution.

p. 146

How sampling is thought to work

1 = ‘prototypical’ sample
different distributions show the label’s distribution after 𝑛 productions

The overall shape approaches a Gaussian distribution as the number of tokens increases. This limiting behavior arises from the fact that the production-perception loop is an additive random process.

Don’t forget that William Kretzschmar doesn’t support this model because it has the wrong shape (should be an A-curve)!

Model II: systematic bias

Introducing hypo-articulation

systematic bias

both hypo- and *hyper-*articulation

↓

p. 147

hypo-articulation

the tendency to undershoot articulatory targets
to save effort and speed up communication

hyper-articulation
(assumed, not explained)

the tendency to overshoot articulatory targets
“trying too hard”

How sampling is thought to work (systematic bias)

Remember, this is not an A-curve.

↳ bias?

leftward bias of -0.01
each token is produced slightly lenited compared to the selected exemplar of the category

p. 147-148

Lindblom is claiming that speakers undershoot targets to the extent possible—e.g. to an extent that still permits communication. It would not be consistent with Lindblom’s general line of thought to think that speakers underarticulate to the point that their target words become unrecoverable.

p. 148

Diachronic and synchronic interpretations of the model

One way to view this figure is diachronically. It shows how the distribution of a category evolves over time after a leniting historical change is first introduced. The mode of the distribution gradually moves towards the left (or lenited) end of the phonetic axis.

The graph also has a synchronic interpretation, provided that we add a key assumption—namely, that not just phonemes, but individual words, have associated exemplar clouds.
For example, we assume that each of the words bet, bed, and bend has an exemplar cloud, and that the exemplar cloud for the phoneme /ɛ/ is the union of the /ɛ/ sections of the exemplar clouds for these words and for all other words containing an /ɛ/.
With this added assumption, the figure may be viewed as displaying a synchronic comparison amongst words of different frequencies which are impacted by the same historical change in progress.
- Since the high frequency words are used more often than the low frequency words, their stored exemplar representations show more numerous impacts of the persistent bias towards lenition.
- As a result, they are further to the left on the axis than the low frequency words.

Merits of the model

Detailed necessary predictions

Each individual word displays a certain amount of variability in production.
The effect of word frequency on lenition rates is gradient.
The effect of word frequency on lenition rates should be observable within the speech of individuals; it is not an artifact of averaging data across the different generations which make up a speech community.
The effect of word frequency on lenition rates should be observable both synchronically (by comparing the pronunciation of words of different frequency) and diachronically (by examining the evolution of word pronunciations over the years within each person’s speech.)

p. 149

Cognitive interpretations

1. shifting patterns in new speech environments

speakers immersed in a new speech environment find that their pronunciation patterns shift over a relatively long time span
- e.g. several months or more

The time span for historical changes is on the order of decades or more. Thus, the extremely high number of iterations used in making the calculations in the figures is not unrealistic. Consider, for example, a leniting change affecting the vowel in the preposition of. The present paper alone has over 200 examples of this word, and 10,000 examples would probably occur in less than one month of speech.

2. historical changes impact the speech of older people less than younger people

possible explanations
1. older people may have more exemplars than younger ones for the same pattern
  - more difficult to sway in a different direction
2. older people are less likely to add new exemplars than young ones
  - the formation of new memories becomes less rapid and robust with age

Model III: entrenchment

Production noise

problem with production noise

the two previous figures had a ‘serious problem’!
in a model with production noise, the variance for any given category steadily increases with usage
⇒ (basically, you get endless entropy if you always allow for new variation)

↕

real life situation

often: practice has the opposite effect

Entrenchment

entrenchment

phonetic variability associated with a typical phonological category decreases gradually

↓

how do we also model entrenchment?

inspiration

Rosenbaum et al. (1993)

neighbourhood selection

~~selection of a single exemplar~~ → selection of a target location, and then a neighbourhood around that location
all exemplars within the neighbourhood contribute towards the realisation

This neighbourhood is in exemplar size, not in exemplar distance.

The neural interpretation of this proposal is that a region in the brain, not merely a single point, is activated when planning a production. Activation-weighted averaging over a group of exemplars results in entrenchment, because averaging mathematically causes reversion towards the mean of a distribution.

How sampling is thought to work (systematic bias + entrenchment)

↳ the entrenchment narrows the distributions, so that the distribution width for the case of 100,000 iterations is roughly comparable to that for 10,000 iterations
↳ spreading effects arising from production noise and lenition and the anti-diffusive effect of entrenchment have essentially cancelled out in determining the variance

p. 151

Entrenchment: other options

Hintzman/Goldinger model

puts entrenchment in perception

↓

entrenchment in perception

storage of an exemplar is skewed by the information which was decisive in categorising that exemplar
⇒ reversion towards the mean

↓

1. influence of neighbourhood

if the neighbourhood is sparsely populated, the pull towards the mean will be low
⟷ Pierrehumbert model: no such thing

We were unable to make a fixed neighborhood work out in the production model since it creates too much instability in the exemplar dynamics at the beginning of the calculation when there are very few examples of a category. This is why an n-nearest-neighbors model is offered here. An integrated model which handles all known neighborhood effects simultaneously remains to be developed.

2. feedback from other levels

people sharpen categories faster and to a greater degree if they receive feedback
- particularly if the feedback provides functionally important rewards or penalties

Speech patterns appear to fall into an intermediate situation, in that people adapt their speech patterns to their speech community even without overt pressures and rewards, but that communicative success and social attunement provide implicit feedback which is certainly important.

p. 152

The model presented here does have feedback, in that it has an informational loop between the stimulus encoding and the abstract level of representation represented by the labelling. If an incoming stimulus is so ambiguous that it can’t be labelled, then it is ignored rather than stored. That is, the exemplar cloud is only updated when the communication was successful to the extent that the speech signal was analyzable.

In addition, the model automatically generates social accommodation of speech patterns, since speech patterns which are heard recently and frequently dominate the set of exemplars for any given label, and therefore guide the typical productions.

Neutralisation

stability

how can ‘drift’ of a certain category come to a halt?

↓

To model this situation, we need to look at two labels which are competing over a phonetic parameter range. We consider the case of a marked phonological category competing with an unmarked one. Following Greenberg and others, we take the unmarked category to be more frequent than the marked one (see papers in Greenberg et al. 1978). In the calculation presented, the unmarked category is three times as frequent as the marked one. The marked category is also the phonetically unstable one which is subject to a persistent bias. The unmarked one is assumed to be phonetically stable.

p. 153

the right hand distribution represents the marked category which is subject to a persistent leftwards bias
the left hand distributrion is a stable unmarked distribution competing for labelling of the same phonetic parameter
the successive panels represent four time slices in the evolution of the situation
1. Because the marked distribution is subject to a persistent bias, it drifts to the left
2. When it approaches the unmarked distribution, some individual tokens which were intended as examples of the marked case are perceived and stored as examples of the unmarked case.
3. This happens more often than the reverse. Insofar as it does happen, the disproportion in frequency between the two categories increases.
4. In the end, the marked category is completely gobbled up by the unmarked one.

Note that the distribution of the unmarked category does show some influence of the marked category it absorbed. Although the location of the distribution is still closer to the original location of the unmarked category than that of the marked category, the mode of the distribution is a bit to the right from where it was.

p. 154

Conclusion

exemplar dynamics

a good model for usage-based phonology
explains incremental modifications

Appendix

TODO for later

p. 229

Probabilistic relations between words: evidence from reduction in lexical production

Introduction

Frequency and probability

frequency?

popular enough in models of language processing

↓

probabilistic information

only recently (2001) thought to play a role

p. 230

↳ goal

understand many factors influencing production variability (reduction, shortening, deletin)
so both frequency and probabilistic information

↓

Probabilistic Reduction Hypothesis

word forms are reduced when they have a higher probability

Probability of a word is conditioned on many aspects of its context, including neighboring words, syntactic and lexical structure, semantic expectations, and discourse factors

This proposal thus generalizes over earlier models which refer only to word frequency (Zipf 1929; Fidelholz 1975; Rhodes 1992; Rhodes 1996) or predictability (Fowler and Housum 1987).

The role of local probabilistic relations between words

probabilistic relations between words

words which are strongly related to or predictable from neighboring words more likely to be phonologically reduced
e.g. collocations (sequences of commonly cooccurring words)

↓ consequences

1. evidence for emergent linguistic structure

(nvda.) grammar arises from use

2. probabilistic relations are represented in the minds of the speaker

(nvda.) transitions between words are found in the speaker mind

↳ results

the hypothesis is true
more probable words are more likely to be reduced
⇒ probabilistic relations between words must play a role in the mental representation of language

Measures of probabilistic relations between words

Definition

Probabilistic Reduction Hypothesis

word forms are reduced when they are predictable or probable

p. 231

Formal measures

Single word

prior probability of a word

the probability without considering any contextual factors
‘prior’ to seeing any other information

Prior probability

P\left(w_i\right)=\frac{C\left(w_i\right)}{\sum_j C\left(w_j\right)}=\frac{C\left(w_i\right)}{N}

the frequency of the word divided by the total number of word tokens

Previous word

joint probability of a word and the previous word

the prior probability of the two words taken together

Joint probability

P\left(w_{i-1} w_i\right)=\frac{C\left(w_{i-1} w_i\right)}{N}

estimated by just looking at the relative frequency of the two words together in a corpus

conditional probability of a word and the previous word /
transitional probability

the probability of a word given the previous word

Conditional probability

P\left(w_{i} \mid w_{i-1}\right)=\frac{C\left(w_{i-1} w_i\right)}{C\left(w_{i-1}\right)}

counting the number of times the two words occur together $C\left(w_{i-1} w_i\right)$ , and dividing by $C\left(w_{i-1}\right)$ , the number of times that the first word occurs

p. 232

Difference between conditional and joint probability?

The conditional probability controls for the frequency of the conditioning word. For example, pairs of words can have a high joint probability merely because the individual words are of high frequency (e.g., of the). The conditional probability would be high only if the second word was particularly likely to follow the first.

Next word

joint probability of a word and the next word

the prior probability of the two words taken together

Joint probability

P\left(w_i w_{i+1}\right)=\frac{C\left(w_i w_{i+1}\right)}{N}

estimated by just looking at the relative frequency of the two words together in a corpus

conditional probability of a word and the next word /
transitional probability

the probability of a word given the previous word

Conditional probability

P\left(w_{i} \mid w_{i+1}\right)=\frac{C\left(w_i w_{i+1}\right)}{C\left(w_{i+1}\right)}

counting the number of times the two words occur together $C\left(w_i w_{i+1}\right)$ , and dividing by $C\left(w_{i+1}\right)$ , the number of times that the following word occurs

Trigram probability

probability given two surrounding words

probability of the target given one word preceding and one word following the target

Neighbourhood probability

P\left(w_i \mid w_{i-1} \ldots w_{i+1}\right) = \frac{C\left(w_{i-1} w_i w_{i+1}\right)}{C\left(w_{i-1} \ldots w_{i+1}\right)}

p. 233

Summary of probabilistic measures and high probability examples
measure	formula	examples
relative frequency	$P(w_i)$	just, right
joint of target with next word	$P(w_i w_{i+1})$	kind of
joint of target with previous word	$P(w_i w_{i-1})$	a lot
conditional of target given previous	$P(w_i \mid w_{i - 1})$	Supreme Court
conditional of target given next	$P(w_i \mid w_{i + 1})$	United States
conditional of target given surrounding	$P(w_i \mid w_{i - 1} \ldots w_{i + 1})$	little bit more

If one wishes to pick a single measure of probability for convenience in reporting, it makes sense to pick one which combines several independent measures, such as mutual information (which combines the joint, the relative frequency of the target, and the relative frequency of the neighboring word) or conditional probability (which combines joint probability and the relative frequency of the neighboring word). We chose conditional probability because for this particular data set it was a better single measure than joint probability.

p. 234

Effects of predictability on function words

Our first experiment studied the ten most frequent English function words in the Switchboard corpus. (These are also the ten most frequent words in the corpus.)

The function word dataset

dataset

the ten most frequent English function words
I, and, the, that, a, you, to, of, it, and in

p. 235

Regression analysis

analysis

multiple regression

p. 236

Control factors

(skipped)

p. 237

Results

Vowel reduction in function words

conditional probability given previous word

added to the regression equation
was significant!
⇒ the higher the conditional probability of the target given the previous word, the greater the expected likelihood of vowel reduction in the function word target

The predicted likelihood of a reduced vowel in words which were highly predictable from the preceding word (at the 95^th percentile of conditional probability) was 48 percent, whereas the likelihood of a reduced vowel in low predictability words (at the 5^th percentile) was 24 percent.

p. 238

conditional probability given next word

also added to regression equation
was also significant!
⇒ the higher the conditional probability of the target given the next word, the greater the expected likelihood of vowel reduction in the function word target

The predicted likelihood of a reduced vowel in words which were highly predictable from the following word (at the 95^th percentile of conditional probability) was 42 percent, whereas the likelihood of a reduced vowel in low predictability words (at the 5^th percentile) was 35 percent. Note that the magnitude of the effect was a good deal weaker than that with the previous word.

conditional probability given two surrounding words

small, additional significant effect of preceding and following words together

Function word duration

durational shortening

significant effect of previous and next word
(also of previous and next together)

Independence of duration and vowel reduction

reduction and probability:
a categorical choice?

either full, or either reduced

↓

no!

effect of predictability on shortening is a gradient, non-categorical one

p. 239

It is possible, however, that the shortening effects that we observe for function words might be solely a consequence of the vowel reduction effects, since reduced vowels are indeed durationally shorter than full vowels. If shortening was only a consequence of vowel selection, there might be no evidence for a gradient effect of probability on reduction.

additional testing and results

predictability not only affects vowel reduction, but has an additional independent non-categorical effect on word duration

The function word dataset: discussion

The results for the function word dataset show that function words that are more predictable are shorter and more likely to have reduced vowels, supporting the Probabilistic Reduction Hypothesis. The conditional probability of the target word given the preceding word and given the following one both play a role, on both duration and deletion. The magnitudes of the duration effects are fairly substantial, in the order of 20 ms or more, or about 20 percent, over the range of the conditional probabilities (excluding the highest and lowest five percent of the items).

p. 239-240

Under one possible model of these effects, the categorical vowel reduction effects could be the result of lexicalization or grammaticalization leading to segmental changes in the lexicon or grammar, while the continuous duration effects are on-line effects, perhaps mediated in part by prosodic structure, but not represented in lexicalized differences.

p. 240

Lexical versus collocation effects

The problem of collocations

problem

many of these pairs (like sort of or kind of) might be single lexical items rather than word pairs (sorta, kinda)

The classification as high-probability word pairs would then stem from the fact that we rely on a purely orthographic definition of a word.

↳ question

are our results purely lexical, rather than syntactic (word-order)?

Solution

solution

show that higher predictability is associated with increased reduction even in word combinations that are not lexicalized
observations → split into two groups: high and low conditional probabilities

p. 241

The ten most probable function word sequences in context from the lower half of the probability range, according to two probability measures. Function words in this lower range did show effects of durational shortening due to higher probability.

Considering first the effects of the preceding word, we found that there was no significant effect of conditional probability on vowel reduction in the low group, but there was a significant effect of conditional probability in the high group.
These results lend some support for the influence of lexicalization.
For duration, however, conditional probability of the preceding word had a significant effect for both groups, although it did appear to be somewhat stronger for the high group.

The results for following word effects did not support the lexicalization hypothesis. Conditional probability of the following word was just as good a predictor of vowel reduction in the low probability group as in the high probability group.

↳ still affected

p. 242

Conclusions

More predictable words are more reduced, even if they are in a low probability group and unlikely to be lexically combined with a neighboring word
- clear evidence for probabilistic relations between words
Particularly for the predictability from the previous word, the high group shows a stronger effect of predictability on reduction
- suggests there is some reduction in duration due to the lexicalization of word pairs

Effects of predictability on final-t/d content words

Do probabilistic relations also hold for content words?

The final-t/d content word dataset

Variables

1. deletion of final consonant

final t-d deletion is defined as the absence of a pronounced oral stop segment corresponding to a final t or d in words

2. duration in milliseconds

duration of the word in milliseconds

p. 243

Control factors

(skipped)

p. 244

Results

Duration

relative frequency

strong effect of the relative frequency of the target word

Overall, high frequency words (at the 95^th percentile of frequency) were 18% shorter than low frequency words (at the 5^th percentile).

conditional probability next word

conditional probability of the target given the next word significantly affected duration
⇒ more predictable words were shorter

Words with high conditional probability (at the 95^th percentile of the conditional probability given the next word) were 12% shorter than low conditional probability words (at the 5^th percentile).

conditional probability previous word

conditional probability of the target given the previous word significantly affected duration
⇒ more predictable words were shorter

Also significant for the joint probability with previous and next word!

Deletion

relative frequency

strong effect of the relative frequency of the target word

High frequency words (at the 95^th percentile) were 2.0 times more likely to have deleted final t or d than the lowest frequency words (at the 5^th percentile).

p. 245

conditional probability next word

did not significantly affect deletion

We had found in earlier work (Gregory et al. 1999) that deletion was not sensitive to predictability effects from the following word. This result was confirmed in our current results. Neither the conditional probability of the target word given the next word nor the relative frequency of the next word predicted deletion of final t or d.

Final-t/d content word dataset: discussion

1. content words with higher relative frequencies
(= prior probabilities)

are shorter and more likely to have deleted final t or d
(than content words with lower relative frequencies)

The effect of target word frequency was the strongest overall factor affecting reduction of content words, and provides support for the Probabilistic Reduction Hypothesis

2. content words with high conditional probability

given previous word: more likely to be shorter
- but: not more likely to undergo final segment deletion

3. comparison with function words

effects of function words are much stronger

Failure to find effects may be due to the smaller number of observations in the content word dataset or the general lower frequencies of content words.

4. previous-word relative frequency

only measure with an effect on deletion
high-frequency previous words led to longer target forms and less final-t/d deletion

p. 246

Another possibility is that the lengthening of content words after frequent previous words is a prosodic effect. For example, if the previous word is frequent, it is less likely to be stressed or accented, which might raise the probability that the current word is stressed or accented, and hence that it is less likely to be reduced.

(this is what I would instinctively guess as well)

Prosodic effects might also explain the asymmetric effect of surrounding words (i.e. preceding words played little role in final deletion). This likely illustrates that not all reduction processes are affected in the same way by probabilistic variables.

Conclusion

general conclusion

we find evidence for the Probabilistic Reduction Hypothesis
more probable words are reduced, whether they are content or function words

Computer simulations of language change notes